Search CORE

2,558 research outputs found

Learning from Noisy Label Distributions

Author: A Culotta
CM Bishop
F Pedregosa
TG Dietterich
Publication venue
Publication date: 10/08/2017
Field of study

In this paper, we consider a novel machine learning problem, that is, learning a classifier from noisy label distributions. In this problem, each instance with a feature vector belongs to at least one group. Then, instead of the true label of each instance, we observe the label distribution of the instances associated with a group, where the label distribution is distorted by an unknown noise. Our goals are to (1) estimate the true label of each instance, and (2) learn a classifier that predicts the true label of a new instance. We propose a probabilistic model that considers true label distributions of groups and parameters that represent the noise as hidden variables. The model can be learned based on a variational Bayesian method. In numerical experiments, we show that the proposed model outperforms existing methods in terms of the estimation of the true labels of instances.Comment: Accepted in ICANN201

arXiv.org e-Print Archive

Crossref

A Real-Time Remote IDS Testbed for Connected Vehicles

Author: Cannady J.
Checkoway S.
Pedregosa F.
Quigley M.
Yang X.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 16/01/2019
Field of study

Connected vehicles are becoming commonplace. A constant connection between vehicles and a central server enables new features and services. This added connectivity raises the likelihood of exposure to attackers and risks unauthorized access. A possible countermeasure to this issue are intrusion detection systems (IDS), which aim at detecting these intrusions during or after their occurrence. The problem with IDS is the large variety of possible approaches with no sensible option for comparing them. Our contribution to this problem comprises the conceptualization and implementation of a testbed for an automotive real-world scenario. That amounts to a server-side IDS detecting intrusions into vehicles remotely. To verify the validity of our approach, we evaluate the testbed from multiple perspectives, including its fitness for purpose and the quality of the data it generates. Our evaluation shows that the testbed makes the effective assessment of various IDS possible. It solves multiple problems of existing approaches, including class imbalance. Additionally, it enables reproducibility and generating data of varying detection difficulties. This allows for comprehensive evaluation of real-time, remote IDS.Comment: Peer-reviewed version accepted for publication in the proceedings of the 34th ACM/SIGAPP Symposium On Applied Computing (SAC'19

arXiv.org e-Print Archive

Crossref

TreeGrad: Transferring Tree Ensembles to Neural Networks

Author: C Siu
DH Wolpert
F Pedregosa
JA Blackard
JH Friedman
K Nakai
L Breiman
SK Murthy
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 09/12/2019
Field of study

Gradient Boosting Decision Tree (GBDT) are popular machine learning algorithms with implementations such as LightGBM and in popular machine learning toolkits like Scikit-Learn. Many implementations can only produce trees in an offline manner and in a greedy manner. We explore ways to convert existing GBDT implementations to known neural network architectures with minimal performance loss in order to allow decision splits to be updated in an online manner and provide extensions to allow splits points to be altered as a neural architecture search problem. We provide learning bounds for our neural network.Comment: Technical Report on Implementation of Deep Neural Decision Forests Algorithm. To accompany implementation here: https://github.com/chappers/TreeGrad. Update: Please cite as: Siu, C. (2019). "Transferring Tree Ensembles to Neural Networks". International Conference on Neural Information Processing. Springer, 2019. arXiv admin note: text overlap with arXiv:1909.1179

arXiv.org e-Print Archive

Crossref

On Using Active Learning and Self-Training when Mining Performance Discussions on Stack Overflow

Author: Allamanis M.
Chowdhury S.
Cicchetti A.
Lin Y.
Pedregosa F.
Settles B.
Settles B.
Soliman M.
Ying A.
Publication venue
Publication date: 01/01/2017
Field of study

Abundant data is the key to successful machine learning. However, supervised learning requires annotated data that are often hard to obtain. In a classification task with limited resources, Active Learning (AL) promises to guide annotators to examples that bring the most value for a classifier. AL can be successfully combined with self-training, i.e., extending a training set with the unlabelled examples for which a classifier is the most certain. We report our experiences on using AL in a systematic manner to train an SVM classifier for Stack Overflow posts discussing performance of software components. We show that the training examples deemed as the most valuable to the classifier are also the most difficult for humans to annotate. Despite carefully evolved annotation criteria, we report low inter-rater agreement, but we also propose mitigation strategies. Finally, based on one annotator's work, we show that self-training can improve the classification accuracy. We conclude the paper by discussing implication for future text miners aspiring to use AL and self-training.Comment: Preprint of paper accepted for the Proc. of the 21st International Conference on Evaluation and Assessment in Software Engineering, 201

arXiv.org e-Print Archive

Lund University Publications

Crossref

Swedish Institute of Computer Science Publications Database

Representation learning for cross-modality classification

Author: BH Menze
F Pedregosa
N Tajbakhsh
VM Patel
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2017
Field of study

Differences in scanning parameters or modalities can complicate image analysis based on supervised classification. This paper presents two representation learning approaches, based on autoencoders, that address this problem by learning representations that are similar across domains. Both approaches use, next to the data representation objective, a similarity objective to minimise the difference between representations of corresponding patches from each domain. We evaluated the methods in transfer learning experiments on multi-modal brain MRI data and on synthetic data. After transforming training and test data from different modalities to the common representations learned by our methods, we trained classifiers for each of pair of modalities. We found that adding the similarity term to the standard objective can produce representations that are more similar and can give a higher accuracy in these cross-modality classification experiments

Crossref

Copenhagen University Research Information System

EUR Research Repository

Erasmus University Digital Repository

ExplainIt! -- A declarative root-cause analysis engine for time series data (extended version)

Author: Benjamini Y.
Cohen I.
Jeyakumar V.
Pedregosa F.
Seth A. K.
Shimizu S.
Tenenbaum J. B.
Wang Y.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 22/03/2019
Field of study

We present ExplainIt!, a declarative, unsupervised root-cause analysis engine that uses time series monitoring data from large complex systems such as data centres. ExplainIt! empowers operators to succinctly specify a large number of causal hypotheses to search for causes of interesting events. ExplainIt! then ranks these hypotheses, reducing the number of causal dependencies from hundreds of thousands to a handful for human understanding. We show how a declarative language, such as SQL, can be effective in declaratively enumerating hypotheses that probe the structure of an unknown probabilistic graphical causal model of the underlying system. Our thesis is that databases are in a unique position to enable users to rapidly explore the possible causal mechanisms in data collected from diverse sources. We empirically demonstrate how ExplainIt! had helped us resolve over 30 performance issues in a commercial product since late 2014, of which we discuss a few cases in detail.Comment: SIGMOD Industry Track 201

arXiv.org e-Print Archive

Crossref

Graph-based Features for Automatic Online Abuse Detection

Author: F Harary
F Pedregosa
G Csardi
J Kleinberg
K Balci
K Dinakar
LC Freeman
MEJ Newman
PF Bonacich
S Brin
SB Seidman
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 03/08/2017
Field of study

While online communities have become increasingly important over the years, the moderation of user-generated content is still performed mostly manually. Automating this task is an important step in reducing the financial cost associated with moderation, but the majority of automated approaches strictly based on message content are highly vulnerable to intentional obfuscation. In this paper, we discuss methods for extracting conversational networks based on raw multi-participant chat logs, and we study the contribution of graph features to a classification system that aims to determine if a given message is abusive. The conversational graph-based system yields unexpectedly high performance , with results comparable to those previously obtained with a content-based approach

arXiv.org e-Print Archive

Crossref

Dynamic Control of Explore/Exploit Trade-Off In Bayesian Optimization

Author: Bobak Shahriari
DR Jones
E Vazquez
F Pedregosa
HJ Kushner
J Bergstra
M Lisicki
N Srivastava
RM Neal
Publication venue
Publication date: 01/01/2018
Field of study

Bayesian optimization offers the possibility of optimizing black-box operations not accessible through traditional techniques. The success of Bayesian optimization methods such as Expected Improvement (EI) are significantly affected by the degree of trade-off between exploration and exploitation. Too much exploration can lead to inefficient optimization protocols, whilst too much exploitation leaves the protocol open to strong initial biases, and a high chance of getting stuck in a local minimum. Typically, a constant margin is used to control this trade-off, which results in yet another hyper-parameter to be optimized. We propose contextual improvement as a simple, yet effective heuristic to counter this - achieving a one-shot optimization strategy. Our proposed heuristic can be swiftly calculated and improves both the speed and robustness of discovery of optimal solutions. We demonstrate its effectiveness on both synthetic and real world problems and explore the unaccounted for uncertainty in the pre-determination of search hyperparameters controlling explore-exploit trade-off.Comment: Accepted for publication in the proceedings of 2018 Computing Conferenc

arXiv.org e-Print Archive

Crossref

ePubs: the open archive for STFC research publications

Large-scale diversity estimation through surname origin inference

Author: Barth F
Beauchemin C
Brutel C
Cediey E
Chang J
Clark G
Foroni F
Herfindahl OC
Hirschman AO
Mislove A
Pedregosa F
Torvik VI
Weber M
Publication venue: 'SAGE Publications'
Publication date: 01/01/2018
Field of study

The study of surnames as both linguistic and geographical markers of the past has proven valuable in several research fields spanning from biology and genetics to demography and social mobility. This article builds upon the existing literature to conceive and develop a surname origin classifier based on a data-driven typology. This enables us to explore a methodology to describe large-scale estimates of the relative diversity of social groups, especially when such data is scarcely available. We subsequently analyze the representativeness of surname origins for 15 socio-professional groups in France

arXiv.org e-Print Archive

Crossref

HAL - UPEC / UPEM

The Poisson-Boltzmann model for implicit solvation of electrolyte solutions: Quantum chemical implementation and assessment via Sechenov coefficients.

Author: Briggs W. L.
Christopher J. Stein
Debye P.
John M. Herbert
Martin Head-Gordon
Pedregosa F.
Sechenov I.
Stern O.
Publication venue: eScholarship, University of California
Publication date: 08/10/2019
Field of study

We present the theory and implementation of a Poisson-Boltzmann implicit solvation model for electrolyte solutions. This model can be combined with arbitrary electronic structure methods that provide an accurate charge density of the solute. A hierarchy of approximations for this model includes a linear approximation for weak electrostatic potentials, finite size of the mobile electrolyte ions, and a Stern-layer correction. Recasting the Poisson-Boltzmann equations into Euler-Lagrange equations then significantly simplifies the derivation of the free energy of solvation for these approximate models. The parameters of the model are either fit directly to experimental observables-e.g., the finite ion size-or optimized for agreement with experimental results. Experimental data for this optimization are available in the form of Sechenov coefficients that describe the linear dependence of the salting-out effect of solutes with respect to the electrolyte concentration. In the final part, we rationalize the qualitative disagreement of the finite ion size modification to the Poisson-Boltzmann model with experimental observations by taking into account the electrolyte concentration dependence of the Stern layer. A route toward a revised model that captures the experimental observations while including the finite ion size effects is then outlined. This implementation paves the way for the study of electrochemical and electrocatalytic processes of molecules and cluster models with accurate electronic structure methods

arXiv.org e-Print Archive

Crossref

eScholarship - University of California